test video
Supplementary Material to Deep Contextual Video Compression
This document provides the supplementary material to our proposed deep contextual video compression (DCVC), including detailed network structures, training strategies, as well as additional experimental results to demonstrate the effectiveness of the proposed DCVC. The above part is the encoder and the below part is the decoder. For simplification, the entropy model is omitted. ResBlock represents plain residual block. The numbers represent channel dimensions.
Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion
Agarwal, Shruti, Hu, Liwen, Ng, Evonne, Darrell, Trevor, Li, Hao, Rohrbach, Anna
In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques. Such falsifications range from cheapfakes (e.g., lookalikes or audio dubbing) to deepfakes (e.g., sophisticated AI media synthesis methods), which are becoming perceptually indistinguishable from real videos. To tackle this challenge, we propose a multi-modal semantic forensic approach to discover clues that go beyond detecting discrepancies in visual quality, thereby handling both simpler cheapfakes and visually persuasive deepfakes. In this work, our goal is to verify that the purported person seen in the video is indeed themselves by detecting anomalous correspondences between their facial movements and the words they are saying. We leverage the idea of attribution to learn person-specific biometric patterns that distinguish a given speaker from others. We use interpretable Action Units (AUs) to capture a persons' face and head movement as opposed to deep CNN visual features, and we are the first to use word-conditioned facial motion analysis. Unlike existing person-specific approaches, our method is also effective against attacks that focus on lip manipulation. We further demonstrate our method's effectiveness on a range of fakes not seen in training including those without video manipulation, that were not addressed in prior work.
- Europe > Belgium (0.04)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
End-to-End Deep Learning for Self-Driving Cars
In a new automotive application, we have used convolutional neural networks (CNNs) to map the raw pixels from a front-facing camera to the steering commands for a self-driving car. This powerful end-to-end approach means that with minimum training data from humans, the system learns to steer, with or without lane markings, on both local roads and highways. The system can also operate in areas with unclear visual guidance such as parking lots or unpaved roads. We designed the end-to-end learning system using an NVIDIA DevBox running Torch 7 for training. An NVIDIA DRIVETM PX self-driving car computer, also with Torch 7, was used to determine where to drive--while operating at 30 frames per second (FPS).
- North America > United States > New Jersey > Monmouth County (0.05)
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York (0.04)
- (2 more...)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Information Technology (1.00)
- Automobiles & Trucks (1.00)
AGENT: A Benchmark for Core Psychological Reasoning
Shu, Tianmin, Bhandwaldar, Abhishek, Gan, Chuang, Smith, Kevin A., Liu, Shari, Gutfreund, Dan, Spelke, Elizabeth, Tenenbaum, Joshua B., Ullman, Tomer D.
For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (2 more...)
Learning Grammar of Complex Activities via Deep Neural Networks
Motivated by the growing amount of publicly available video data on online streaming services and an increased interest in applications that analyze continuous video streams such as autonomous driving, this technical report provides a theoretical insight into deep neural networks for video learning, under label constraints. I build upon previous work in video learning for computer vision, make observations on model performance and propose further mechanisms to help improve our observations.
End-to-End Deep Learning for Self-Driving Cars
In a new automotive application, we have used convolutional neural networks (CNNs) to map the raw pixels from a front-facing camera to the steering commands for a self-driving car. This powerful end-to-end approach means that with minimum training data from humans, the system learns to steer, with or without lane markings, on both local roads and highways. The system can also operate in areas with unclear visual guidance such as parking lots or unpaved roads. We designed the end-to-end learning system using an NVIDIA DevBox running Torch 7 for training. An NVIDIA DRIVETM PX self-driving car computer, also with Torch 7, was used to determine where to drive--while operating at 30 frames per second (FPS).
- North America > United States > New Jersey > Monmouth County (0.05)
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York (0.04)
- (2 more...)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection
Chang, Xiaojun (University of Technology Sydney) | Yang, Yi (University of Technology Sydney) | Hauptmann, Alexander (Carnegie Mellon University) | Xing, Eric P (Carnegie Mellon University) | Yu, Yao-Liang (Carnegie Mellon University)
We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied. We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach. We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied.We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.65)